Language underpins nearly all forms of social interaction:
Until recently, analyzing these interactions quantitatively was a challenge.
There are large quantities of text available in the form of digitized books and texts
Now, we also have powerful methods to analyze texts
Throughout the rest of the semester, we will use different methods to:
Assigning numbers to words and documents to measure latent concepts in text.
So, we want to assign numbers that enable us measure latent concepts from large corpora of text.
Latent concepts means that we cannot observe them directly, but rather we have to infer them from observed text.
Thus, we need to find strategies to score words and documents in corpus.
Examples:
1. When did Western Political Thought start diverging from Islamic political thought
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
3. How has the cultural meaning of words changed over time?
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
3. How has the cultural meaning of words changed over time?
4. How can we detect online hate speech?
1. Language models are wrong but some are useful
2. We need to validate text-analysis insights with domain knowledge
3. We need to combine quantitative and qualitative insights
Typically, text analysis entails more steps:
A Document-feature matrix is a typical way of representing text in a quantitative form
The rows of a matrix represent the documents
The columns indicate the features (e.g. words).
We need to make decisions about which documents and features are important to build a document-feature matrix.
A document is the basic unit of text analysis
A corpus is a structured set of documents for analysis
Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words or “tokens.”
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message